Eecient Algorithms for Decision Tree Cross-validation (extended Abstract)
نویسندگان
چکیده
Extended abstract Cross-validation is a generally applicable and very useful technique for many tasks often encountered in machine learning, such as accuracy estimation, feature selection or parameter tuning. A common property of these tasks is that one wants to validate a learned theory on a set of examples not used for its construction (i.e., an \independent test set"). When insuucient data are available to reliably train on one subset of the data and validate the results on a disjoint subset, cross-validation provides an outcome. It consists of partitioning a data set D into n subsets D i and then running a given algorithm n times, each time using a diierent training set D ? D i and validating the results on D i. The results on each D i are averaged to provide a reliable estimate of the induced model's performance on unseen cases. An often mentioned disadvantage of cross-validation is its computational cost: the learning algorithm needs to be run n times. However, while conceptually this is true, it need not be implemented this way. The purpose of this paper is to show that, in a number of cases, a full cross-validation can be performed with only little overhead over the original induction algorithm. The main contributions of this work are as follows. We show how to extend classical algorithms for decision tree induction 5, 4] in such a way that a full cross-validation is integrated with the induction process at a minimal cost; the key is to observe that in a cross-validation a lot of redundant computations are performed, and by rearranging these computations we can often reuse results instead of recomputing them. We analyse the computational complexity of the novel algorithm, identifying those parameters that innuence the overhead most. It turns out that compared to the standard implementation of cross-validation, our method
منابع مشابه
Eecient Algorithms for Decision Tree Cross-validation
Cross-validation is a useful and generally applicable technique often employed in machine learning, including decision tree induction. An important disadvantage of straightforward implementation of the technique is its computational overhead. In this paper we show that, for decision trees, the computational overhead of cross-validation can be reduced signiicantly by integrating the cross-valida...
متن کاملEfficient algorithms for decision tree cross-validation
Cross-validation is a useful and generally applicable technique often employed in machine learning, including decision tree induction. An important disadvantage of straightforward implementation of the technique is its computational overhead. In this paper we show that, for decision trees, the computational overhead of cross-validation can be reduced significantly by integrating the crossvalida...
متن کاملComparison of Performance in Image Classification Algorithms of Satellite in Detection of Sarakhs Sandy zones
Extended abstract 1- Introduction Wind erosion as an “environmental threat” has caused serious problems in the world. Identifying and evaluating areas affected by wind erosion can be an important tool for managers and planners in the sustainable development of different areas. nowadays there are various methods in the world for zoning lands affected by wind erosion. One of the most important...
متن کاملEnsemble Classification and Extended Feature Selection for Credit Card Fraud Detection
Due to the rise of technology, the possibility of fraud in different areas such as banking has been increased. Credit card fraud is a crucial problem in banking and its danger is over increasing. This paper proposes an advanced data mining method, considering both feature selection and decision cost for accuracy enhancement of credit card fraud detection. After selecting the best and most effec...
متن کاملCross-Validated C4.5: Using Error Estimation for Automatic Parameter Selection
Machine learning algorithms for supervised learning are in wide use. An important issue in the use of these algorithms is how to set the parameters of the algorithm. While the default parameter values may be appropriate for a wide variety of tasks, they are not necessarily optimal for a given task. In this paper, we investigate the use of cross-validation to select parameters for the C4.5 decis...
متن کامل